5420 Anomaly Detection | Assignment 2 - Joyce Ng (jn2901)¶
Credit Card Fraud Detection Modeling¶
The Office of Management and Enterprise Services in the State of Oklahoma has made its purchase credit card transactions available. This dataset contains information on purchases made through the purchase card programs administered by the state and higher education institutions. This analysis will utilized Histogram Based Outlier Score and Empirical Cumulative Distribution-based Outlier Detection to identify anomalies.
Table of Contents¶
- Section 1: Data Preparation
- Section 2: EDA
- Section 3: Feature Engineering
- Section 4: Histogram-Based Outlier Score (HBOS)
- Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD)
- Section 6: Models Predictions Comparison
- Section 7: Conclusion
Section 1: Data Preparation ¶
1.1 Load Libraries and Dataset ¶
# Load libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.utils.data import generate_data
from pyod.models.hbos import HBOS
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.ecod import ECOD
# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/2 - Credit Card/purchase_credit_card.csv')
df.head()
| Year-Month | Agency Number | Agency Name | Cardholder Last Name | Cardholder First Initial | Description | Amount | Vendor | Transaction Date | Posted Date | Merchant Category Code (MCC) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | GENERAL PURCHASE | 890.00 | NACAS | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS |
| 1 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | ROOM CHARGES | 368.96 | SHERATON HOTEL | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | SHERATON |
| 2 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | J | GENERAL PURCHASE | 165.82 | SEARS.COM 9300 | 07/29/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... |
| 3 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | T | GENERAL PURCHASE | 96.39 | WAL-MART #0137 | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | GROCERY STORES,AND SUPERMARKETS |
| 4 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mauro-Herrera | M | HAMMERMILL COPY PLUS COPY EA | 125.96 | STAPLES DIRECT | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... |
# Check dimensions of the dataset, we found it has 442,458 rows and 11 columns
print(df.shape, "\n") # Check the shape of df
print(df.columns, "\n") # Check column names
print(df.info(), "\n") # Check info of the df
df.describe() # Get the Simple Summary Statistics
(442458, 11)
Index(['Year-Month', 'Agency Number', 'Agency Name', 'Cardholder Last Name',
'Cardholder First Initial', 'Description', 'Amount', 'Vendor',
'Transaction Date', 'Posted Date', 'Merchant Category Code (MCC)'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year-Month 442458 non-null int64
1 Agency Number 442458 non-null int64
2 Agency Name 442458 non-null object
3 Cardholder Last Name 442458 non-null object
4 Cardholder First Initial 442458 non-null object
5 Description 442458 non-null object
6 Amount 442458 non-null float64
7 Vendor 442458 non-null object
8 Transaction Date 442458 non-null object
9 Posted Date 442458 non-null object
10 Merchant Category Code (MCC) 442458 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 37.1+ MB
None
| Year-Month | Agency Number | Amount | |
|---|---|---|---|
| count | 442458.000000 | 442458.000000 | 4.424580e+05 |
| mean | 201357.284375 | 42785.860353 | 4.249912e+02 |
| std | 47.107417 | 33378.461293 | 5.266509e+03 |
| min | 201307.000000 | 1000.000000 | -4.286304e+04 |
| 25% | 201309.000000 | 1000.000000 | 3.091000e+01 |
| 50% | 201401.000000 | 47700.000000 | 1.048900e+02 |
| 75% | 201404.000000 | 76000.000000 | 3.450000e+02 |
| max | 201406.000000 | 98000.000000 | 1.903858e+06 |
1.2: Check Missing Values & Change Columns Names and Data Type ¶
# Check for missing values
missing_values = df.isnull().sum()
missing_values
Year-Month 0 Agency Number 0 Agency Name 0 Cardholder Last Name 0 Cardholder First Initial 0 Description 0 Amount 0 Vendor 0 Transaction Date 0 Posted Date 0 Merchant Category Code (MCC) 0 dtype: int64
1.3: Create Data Fields ¶
# Change column names
df.columns = ['Year_Month', 'Agency_Number', 'Agency_Name', 'Cardholder_Last_Name',
'Cardholder_First_Initial', 'Description', 'Amount', 'Vendor', 'Transaction_Date',
'Posted_Date', 'Merchant_Category']
# Creating separate Year and Month columns for future feature engineering
# Converting Transaction_Date and Posted_date from Python Object data type to datetime
df['Transaction_Date']=pd.to_datetime(df['Transaction_Date'])
df['Posted_Date']=pd.to_datetime(df['Posted_Date'])
df['Year_Month'].dtype # Check data type for information purpose
df['Year_Month']=pd.to_datetime(df['Year_Month'], format='%Y%m')
df['Year'] = df['Year_Month'].dt.year
df['Month'] = df['Year_Month'].dt.month
#Creating 'Week_Number' from 'Transaction_Date'
df['Week_Number'] = df['Transaction_Date'].dt.isocalendar().week
#Creating 'Day_of_Week' from 'Transaction_Date'
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['Day_of_Week'] = df['Transaction_Date'].dt.dayofweek.map(dayOfWeek)
df['Day_of_Week'].value_counts()
Day_of_Week Wednesday 89644 Thursday 87537 Tuesday 86919 Friday 79917 Monday 61809 Saturday 26307 Sunday 10325 Name: count, dtype: int64
df.head()
| Year_Month | Agency_Number | Agency_Name | Cardholder_Last_Name | Cardholder_First_Initial | Description | Amount | Vendor | Transaction_Date | Posted_Date | Merchant_Category | Year | Month | Week_Number | Day_of_Week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | GENERAL PURCHASE | 890.00 | NACAS | 2013-07-30 | 2013-07-31 | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS | 2013 | 7 | 31 | Tuesday |
| 1 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | ROOM CHARGES | 368.96 | SHERATON HOTEL | 2013-07-30 | 2013-07-31 | SHERATON | 2013 | 7 | 31 | Tuesday |
| 2 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | J | GENERAL PURCHASE | 165.82 | SEARS.COM 9300 | 2013-07-29 | 2013-07-31 | DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... | 2013 | 7 | 31 | Monday |
| 3 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | T | GENERAL PURCHASE | 96.39 | WAL-MART #0137 | 2013-07-30 | 2013-07-31 | GROCERY STORES,AND SUPERMARKETS | 2013 | 7 | 31 | Tuesday |
| 4 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mauro-Herrera | M | HAMMERMILL COPY PLUS COPY EA | 125.96 | STAPLES DIRECT | 2013-07-30 | 2013-07-31 | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 2013 | 7 | 31 | Tuesday |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 442458 entries, 0 to 442457 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Month 442458 non-null datetime64[ns] 1 Agency_Number 442458 non-null int64 2 Agency_Name 442458 non-null object 3 Cardholder_Last_Name 442458 non-null object 4 Cardholder_First_Initial 442458 non-null object 5 Description 442458 non-null object 6 Amount 442458 non-null float64 7 Vendor 442458 non-null object 8 Transaction_Date 442458 non-null datetime64[ns] 9 Posted_Date 442458 non-null datetime64[ns] 10 Merchant_Category 442458 non-null object 11 Year 442458 non-null int32 12 Month 442458 non-null int32 13 Week_Number 442458 non-null UInt32 14 Day_of_Week 442458 non-null object dtypes: UInt32(1), datetime64[ns](3), float64(1), int32(2), int64(1), object(7) memory usage: 46.0+ MB
# Summarize the count statistics by agency_name in df_count
df_count = df['Agency_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Agency_Name','Count']
fig = px.bar(df_count, x='Agency_Name', y='Count', color = 'Agency_Name', width=1000, height=400)
sns.set_theme(style="whitegrid")
sns.barplot(y="Agency_Name", x="Count", data=df_count[1:20])
plt.title('Distribution of Agency', fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Agency Name')
plt.show()
Distribution of Transaction Amount¶
# Set the style of the seaborn plot
sns.set_style("whitegrid")
# Create a histogram for transaction amounts with a log scale
plt.figure(figsize=(10, 6))
ax = sns.histplot(data=df['Amount'], bins=100, kde=True, log_scale=True)
# Format x-axis tick labels to add comma formatting
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x) if x >= 1 else '${:.2f}'.format(x)))
plt.title('Distribution of Transaction Amounts', fontweight='bold')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.savefig('Distribution of Transaction Amount.png', transparent=True)
# Display the plot
plt.show()
Interaction of Transaction Amounts by Agency and Merchant Category¶
# Calculate the total transaction amount for each Agency and Merchant Category
df_grouped = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].sum().reset_index()
# Take top 40 agencies for a clearer visualization
df_grouped = df_grouped.sort_values(by='Amount', ascending=False)
df_grouped = df_grouped.head(40)
# Sort DataFrame by Amount
df_grouped = df_grouped.sort_values(by='Amount', ascending=True)
# Create the scatter plot
fig = px.scatter(df_grouped,
x="Agency_Name",
y="Merchant_Category",
color="Amount",
size="Amount",
title='Interaction of Transaction Amounts by Agency and Merchant Category',
width=1000, height=600,
color_continuous_scale='Viridis_r'
)
# Customize layout to edit legend title and set background to white
fig.update_layout(
coloraxis_colorbar=dict(
title="Amount ($)" # Legend title
),
title={
'text': 'Interaction of Transaction Amounts by Agency and Merchant Category',
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'
},
paper_bgcolor="rgba(0,0,0,0)", # Set the background color of the paper (outer) to transparent
plot_bgcolor='white' # Set the background color of the plotting area (inner) to white
)
# Customize x-axis
fig.update_xaxes(
title = 'Agency Name',
mirror=True,
ticks='outside',
showline=True,
linecolor='lightgrey',
gridcolor='lightgrey',
showgrid=False,
tickangle=-30
)
# Customize y-axis
fig.update_yaxes(
title = 'Merchant Category',
mirror=True,
ticks='outside',
showline=True,
linecolor='lightgrey',
gridcolor='lightgrey',
showgrid=False
)
fig.write_image("Interactions of two or three variables.png")
# Show the plot
fig.show()
Section 3: Feature Engineering ¶
- Avg amount by agency, merchant
- Avg amount by agency, merchant in last 30 days
- Avg amount by agency, merchant in last 60 days
- Avg amount by agency, merchant in last 90 days
- Max amount by agency, merchant in last 30 days
- Max amount by agency, merchant in last 60 days
- Max amount by agency, merchant in last 90 days
- Avg transaction by agency, merchant
- Avg transaction by agency, merchant in last 30 days
- Avg transaction by agency, merchant in last 60 days
- Avg transaction by agency, merchant in last 90 days
- Avg among by agency, vendor
- Avg amount by agency, vendor in last 30 days
- Avg amount by agency, vendor in last 60 days
- Avg amount by agency, vendor in last 90 days
Features mainly focused on 4 main categories: Average Amount by Agency and Merchant, Maximum Amount by Agency and Merchant, Average Transaction Count by Agency and Merchant, Average Amount by Agency and Vendor. Within each category, we will look at the overall transaction data and across different time period i.e. in last 30, 60, 90 days to identify spending patterns and potential anomalies.
3.1 Feature Benchmarks¶
# Define functions and variables for benchmark calculations
def calculate_avg_amount(data, group_by_cols):
return data.groupby(group_by_cols)['Amount'].mean().reset_index(name='Avg_Amount')
def calculate_max_amount(data, group_by_cols):
return data.groupby(group_by_cols)['Amount'].max().reset_index(name='Max_Amount')
def calculate_avg_transaction(data, group_by_cols):
return data.groupby(group_by_cols).size().reset_index(name='Avg_Transaction_Count')
def calculate_last_n_days(data, n):
cutoff_date = data['Transaction_Date'].max() - pd.Timedelta(days=n)
return data[data['Transaction_Date'] > cutoff_date]
# Define the group by columns
agency_merchant_cols = ['Agency_Name', 'Merchant_Category']
agency_vendor_cols = ['Agency_Name', 'Vendor']
# Calculate features
features = {}
# Feature 1: Average Amount by Agency, Merchant
features['avg_amount'] = calculate_avg_amount(df, agency_merchant_cols)
print("Feature 1: Average Amount by Agency, Merchant")
print(features['avg_amount'].head(),'\n')
# Feature 2: Average Amount by Agency, Merchant in last 30 days
features['avg_amount_30_days'] = calculate_avg_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 2: Average Amount by Agency, Merchant in last 30 days")
print(features['avg_amount_30_days'].head(),'\n')
# Feature 3: Average Amount by Agency, Merchant in last 60 days
features['avg_amount_60_days'] = calculate_avg_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 3: Average Amount by Agency, Merchant in last 60 days")
print(features['avg_amount_60_days'].head(),'\n')
# Feature 4: Average Amount by Agency, Merchant in last 90 days
features['avg_amount_90_days'] = calculate_avg_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 4: Average Amount by Agency, Merchant in last 90 days")
print(features['avg_amount_90_days'].head(),'\n')
# Feature 5: Maximum Amount by Agency, Merchant in last 30 days
features['max_amount_30_days'] = calculate_max_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 5: Maximum Amount by Agency, Merchant in last 30 days")
print(features['max_amount_30_days'].head(),'\n')
# Feature 6: Maximum Amount by Agency, Merchant in last 60 days
features['max_amount_60_days'] = calculate_max_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 6: Maximum Amount by Agency, Merchant in last 60 days")
print(features['max_amount_60_days'].head(),'\n')
# Feature 7: Maximum Amount by Agency, Merchant in last 90 days
features['max_amount_90_days'] = calculate_max_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 7: Maximum Amount by Agency, Merchant in last 90 days")
print(features['max_amount_90_days'].head(),'\n')
# Feature 8: Average Transaction by Agency, Merchant
features['avg_transaction'] = calculate_avg_transaction(df, agency_merchant_cols)
print("Feature 8: Average Transaction by Agency, Merchant")
print(features['avg_transaction'].head(),'\n')
# Feature 9: Average Transaction by Agency, Merchant in last 30 days
features['avg_transaction_30_days'] = calculate_avg_transaction(calculate_last_n_days(df, 30), agency_merchant_cols)
print("Feature 9: Average Transaction by Agency, Merchant in last 30 days")
print(features['avg_transaction_30_days'].head(),'\n')
# Feature 10: Average Transaction by Agency, Merchant in last 60 days
features['avg_transaction_60_days'] = calculate_avg_transaction(calculate_last_n_days(df, 60), agency_merchant_cols)
print("Feature 10: Average Transaction by Agency, Merchant in last 60 days")
print(features['avg_transaction_60_days'].head(),'\n')
# Feature 11: Average Transaction by Agency, Merchant in last 90 days
features['avg_transaction_90_days'] = calculate_avg_transaction(calculate_last_n_days(df, 90), agency_merchant_cols)
print("Feature 11: Average Transaction by Agency, Merchant in last 90 days")
print(features['avg_transaction_90_days'].head(),'\n')
# Feature 12: Average Amount by Agency, Vendor
features['avg_amount_vendor'] = calculate_avg_amount(df, agency_vendor_cols)
print("Feature 12: Average Amount by Agency, Vendor")
print(features['avg_amount_vendor'].head(),'\n')
# Feature 13: Average Amount by Agency, Vendor in last 30 days
features['avg_amount_vendor_30_days'] = calculate_avg_amount(calculate_last_n_days(df, 30), agency_vendor_cols)
print("Feature 13: Average Amount by Agency, Vendor in last 30 days")
print(features['avg_amount_vendor_30_days'].head(),'\n')
# Feature 14: Average Amount by Agency, Vendor in last 60 days
features['avg_amount_vendor_60_days'] = calculate_avg_amount(calculate_last_n_days(df, 60), agency_vendor_cols)
print("Feature 14: Average Amount by Agency, Vendor in last 60 days")
print(features['avg_amount_vendor_60_days'].head(),'\n')
# Feature 15: Average Amount by Agency, Vendor in last 90 days
features['avg_amount_vendor_90_days'] = calculate_avg_amount(calculate_last_n_days(df, 90), agency_vendor_cols)
print("Feature 15: Average Amount by Agency, Vendor in last 90 days")
print(features['avg_amount_vendor_90_days'].head())
Feature 1: Average Amount by Agency, Merchant
Agency_Name \
0 ARDMORE HIGHER EDUCATION CENTER
1 ARDMORE HIGHER EDUCATION CENTER
2 ARDMORE HIGHER EDUCATION CENTER
3 ARDMORE HIGHER EDUCATION CENTER
4 ARDMORE HIGHER EDUCATION CENTER
Merchant_Category Avg_Amount
0 BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 115.120000
1 CATALOG MERCHANTS -53.344286
2 COMPUTER NETWORK/INFORMATION SERVICES 18.340000
3 DRUG STORES AND PHARMACIES 153.450000
4 GROCERY STORES,AND SUPERMARKETS 74.525455
Feature 2: Average Amount by Agency, Merchant in last 30 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Avg_Amount
0 1040.000000
1 462.781429
2 436.428571
3 24.590000
4 875.000000
Feature 3: Average Amount by Agency, Merchant in last 60 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Avg_Amount
0 1040.000000
1 465.248125
2 123.225000
3 332.034545
4 118.990000
Feature 4: Average Amount by Agency, Merchant in last 90 days
Agency_Name Merchant_Category Avg_Amount
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.000000
1 ATTORNEY GENERAL AMERICAN AIRLINES 417.765667
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 97.215000
3 ATTORNEY GENERAL BOOK STORES 123.225000
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 299.336923
Feature 5: Maximum Amount by Agency, Merchant in last 30 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Max_Amount
0 1040.00
1 982.00
2 1000.00
3 24.59
4 1325.00
Feature 6: Maximum Amount by Agency, Merchant in last 60 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Max_Amount
0 1040.00
1 982.00
2 208.50
3 1000.00
4 306.66
Feature 7: Maximum Amount by Agency, Merchant in last 90 days
Agency_Name Merchant_Category Max_Amount
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.00
1 ATTORNEY GENERAL AMERICAN AIRLINES 982.00
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 186.17
3 ATTORNEY GENERAL BOOK STORES 208.50
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1000.00
Feature 8: Average Transaction by Agency, Merchant
Agency_Name \
0 ARDMORE HIGHER EDUCATION CENTER
1 ARDMORE HIGHER EDUCATION CENTER
2 ARDMORE HIGHER EDUCATION CENTER
3 ARDMORE HIGHER EDUCATION CENTER
4 ARDMORE HIGHER EDUCATION CENTER
Merchant_Category Avg_Transaction_Count
0 BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1
1 CATALOG MERCHANTS 7
2 COMPUTER NETWORK/INFORMATION SERVICES 1
3 DRUG STORES AND PHARMACIES 1
4 GROCERY STORES,AND SUPERMARKETS 11
Feature 9: Average Transaction by Agency, Merchant in last 30 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Avg_Transaction_Count
0 1
1 7
2 7
3 1
4 3
Feature 10: Average Transaction by Agency, Merchant in last 60 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Avg_Transaction_Count
0 1
1 16
2 2
3 11
4 3
Feature 11: Average Transaction by Agency, Merchant in last 90 days
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES
3 ATTORNEY GENERAL BOOK STORES
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
Avg_Transaction_Count
0 1
1 30
2 2
3 2
4 13
Feature 12: Average Amount by Agency, Vendor
Agency_Name Vendor Avg_Amount
0 ARDMORE HIGHER EDUCATION CENTER AGRI PRODUCTS 92.500000
1 ARDMORE HIGHER EDUCATION CENTER CDW GOVERNMENT -53.344286
2 ARDMORE HIGHER EDUCATION CENTER COUNTY BUILDING CE 58.440000
3 ARDMORE HIGHER EDUCATION CENTER EMPIRE PAPER COMPANY 443.540000
4 ARDMORE HIGHER EDUCATION CENTER GODADDY.COM 18.340000
Feature 13: Average Amount by Agency, Vendor in last 30 days
Agency_Name Vendor Avg_Amount
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.20
1 ATTORNEY GENERAL AMERICAN AI 0017450939750 -250.03
2 ATTORNEY GENERAL AMERICAN AI 0017456015142 982.00
3 ATTORNEY GENERAL AMERICAN AI 0017459356062 669.00
4 ATTORNEY GENERAL AMERICAN AI 0017459356079 687.50
Feature 14: Average Amount by Agency, Vendor in last 60 days
Agency_Name Vendor Avg_Amount
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017450939750 205.985
3 ATTORNEY GENERAL AMERICAN AI 0017450939785 -361.000
4 ATTORNEY GENERAL AMERICAN AI 0017450939901 250.000
Feature 15: Average Amount by Agency, Vendor in last 90 days
Agency_Name Vendor Avg_Amount
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017394725288 444.000
3 ATTORNEY GENERAL AMERICAN AI 0017395886798 673.000
4 ATTORNEY GENERAL AMERICAN AI 0017395886825 0.000
3.2 Feature Creation¶
# Calculate ratios between features
def calculate_ratios(features):
ratios = {}
# Define base features for ratio calculations (using overall avg amount as a baseline for simplicity)
baseline_avg_amount = features['avg_amount']['Avg_Amount'].mean()
baseline_transaction_count = features['avg_transaction']['Avg_Transaction_Count'].mean()
baseline_avg_amount_vendor = features['avg_amount_vendor']['Avg_Amount'].mean()
# Calculate ratios for average amounts
ratios['avg_amount_ratio'] = features['avg_amount'].copy()
if baseline_avg_amount != 0:
ratios['avg_amount_ratio']['Avg_Amount_Ratio'] = features['avg_amount']['Avg_Amount'] / baseline_avg_amount
else:
ratios['avg_amount_ratio']['Avg_Amount_Ratio'] = 0
ratios['avg_amount_30_days_ratio'] = features['avg_amount_30_days'].copy()
if baseline_avg_amount != 0:
ratios['avg_amount_30_days_ratio']['Avg_Amount_30_Days_Ratio'] = features['avg_amount_30_days']['Avg_Amount'] / baseline_avg_amount
else:
ratios['avg_amount_30_days_ratio']['Avg_Amount_30_Days_Ratio'] = 0
ratios['avg_amount_60_days_ratio'] = features['avg_amount_60_days'].copy()
if baseline_avg_amount != 0:
ratios['avg_amount_60_days_ratio']['Avg_Amount_60_Days_Ratio'] = features['avg_amount_60_days']['Avg_Amount'] / baseline_avg_amount
else:
ratios['avg_amount_60_days_ratio']['Avg_Amount_60_Days_Ratio'] = 0
ratios['avg_amount_90_days_ratio'] = features['avg_amount_90_days'].copy()
if baseline_avg_amount != 0:
ratios['avg_amount_90_days_ratio']['Avg_Amount_90_Days_Ratio'] = features['avg_amount_90_days']['Avg_Amount'] / baseline_avg_amount
else:
ratios['avg_amount_90_days_ratio']['Avg_Amount_90_Days_Ratio'] = 0
# Calculate ratios for maximum amounts
ratios['max_amount_30_days_ratio'] = features['max_amount_30_days'].copy()
if baseline_avg_amount != 0:
ratios['max_amount_30_days_ratio']['Max_Amount_30_Days_Ratio'] = features['max_amount_30_days']['Max_Amount'] / baseline_avg_amount
else:
ratios['max_amount_30_days_ratio']['Max_Amount_30_Days_Ratio'] = 0
ratios['max_amount_60_days_ratio'] = features['max_amount_60_days'].copy()
if baseline_avg_amount != 0:
ratios['max_amount_60_days_ratio']['Max_Amount_60_Days_Ratio'] = features['max_amount_60_days']['Max_Amount'] / baseline_avg_amount
else:
ratios['max_amount_60_days_ratio']['Max_Amount_60_Days_Ratio'] = 0
ratios['max_amount_90_days_ratio'] = features['max_amount_90_days'].copy()
if baseline_avg_amount != 0:
ratios['max_amount_90_days_ratio']['Max_Amount_90_Days_Ratio'] = features['max_amount_90_days']['Max_Amount'] / baseline_avg_amount
else:
ratios['max_amount_90_days_ratio']['Max_Amount_90_Days_Ratio'] = 0
# Calculate ratios for average transaction counts
ratios['avg_transaction_ratio'] = features['avg_transaction'].copy()
if baseline_transaction_count != 0:
ratios['avg_transaction_ratio']['Avg_Transaction_Ratio'] = features['avg_transaction']['Avg_Transaction_Count'] / baseline_transaction_count
else:
ratios['avg_transaction_ratio']['Avg_Transaction_Ratio'] = 0
ratios['avg_transaction_30_days_ratio'] = features['avg_transaction_30_days'].copy()
if baseline_transaction_count != 0:
ratios['avg_transaction_30_days_ratio']['Avg_Transaction_30_Days_Ratio'] = features['avg_transaction_30_days']['Avg_Transaction_Count'] / baseline_transaction_count
else:
ratios['avg_transaction_30_days_ratio']['Avg_Transaction_30_Days_Ratio'] = 0
ratios['avg_transaction_60_days_ratio'] = features['avg_transaction_60_days'].copy()
if baseline_transaction_count != 0:
ratios['avg_transaction_60_days_ratio']['Avg_Transaction_60_Days_Ratio'] = features['avg_transaction_60_days']['Avg_Transaction_Count'] / baseline_transaction_count
else:
ratios['avg_transaction_60_days_ratio']['Avg_Transaction_60_Days_Ratio'] = 0
ratios['avg_transaction_90_days_ratio'] = features['avg_transaction_90_days'].copy()
if baseline_transaction_count != 0:
ratios['avg_transaction_90_days_ratio']['Avg_Transaction_90_Days_Ratio'] = features['avg_transaction_90_days']['Avg_Transaction_Count'] / baseline_transaction_count
else:
ratios['avg_transaction_90_days_ratio']['Avg_Transaction_90_Days_Ratio'] = 0
# Calculate ratios for average amounts by vendor
ratios['avg_amount_vendor_ratio'] = features['avg_amount_vendor'].copy()
if baseline_avg_amount_vendor != 0:
ratios['avg_amount_vendor_ratio']['Avg_Amount_Vendor_Ratio'] = features['avg_amount_vendor']['Avg_Amount'] / baseline_avg_amount_vendor
else:
ratios['avg_amount_vendor_ratio']['Avg_Amount_Vendor_Ratio'] = 0
ratios['avg_amount_vendor_30_days_ratio'] = features['avg_amount_vendor_30_days'].copy()
if baseline_avg_amount_vendor != 0:
ratios['avg_amount_vendor_30_days_ratio']['Avg_Amount_Vendor_30_Days_Ratio'] = features['avg_amount_vendor_30_days']['Avg_Amount'] / baseline_avg_amount_vendor
else:
ratios['avg_amount_vendor_30_days_ratio']['Avg_Amount_Vendor_30_Days_Ratio'] = 0
ratios['avg_amount_vendor_60_days_ratio'] = features['avg_amount_vendor_60_days'].copy()
if baseline_avg_amount_vendor != 0:
ratios['avg_amount_vendor_60_days_ratio']['Avg_Amount_Vendor_60_Days_Ratio'] = features['avg_amount_vendor_60_days']['Avg_Amount'] / baseline_avg_amount_vendor
else:
ratios['avg_amount_vendor_60_days_ratio']['Avg_Amount_Vendor_60_Days_Ratio'] = 0
ratios['avg_amount_vendor_90_days_ratio'] = features['avg_amount_vendor_90_days'].copy()
if baseline_avg_amount_vendor != 0:
ratios['avg_amount_vendor_90_days_ratio']['Avg_Amount_Vendor_90_Days_Ratio'] = features['avg_amount_vendor_90_days']['Avg_Amount'] / baseline_avg_amount_vendor
else:
ratios['avg_amount_vendor_90_days_ratio']['Avg_Amount_Vendor_90_Days_Ratio'] = 0
return ratios
# Calculate and print ratios
ratios = calculate_ratios(features)
# Displaying ratios for validation
for key, value in ratios.items():
print(f"Ratios for {key}")
print(value.head(),'\n')
Ratios for avg_amount_ratio
Agency_Name \
0 ARDMORE HIGHER EDUCATION CENTER
1 ARDMORE HIGHER EDUCATION CENTER
2 ARDMORE HIGHER EDUCATION CENTER
3 ARDMORE HIGHER EDUCATION CENTER
4 ARDMORE HIGHER EDUCATION CENTER
Merchant_Category Avg_Amount Avg_Amount_Ratio
0 BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 115.120000 0.218589
1 CATALOG MERCHANTS -53.344286 -0.101290
2 COMPUTER NETWORK/INFORMATION SERVICES 18.340000 0.034824
3 DRUG STORES AND PHARMACIES 153.450000 0.291370
4 GROCERY STORES,AND SUPERMARKETS 74.525455 0.141509
Ratios for avg_amount_30_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Avg_Amount Avg_Amount_30_Days_Ratio
0 1040.000000 1.974746
1 462.781429 0.878727
2 436.428571 0.828688
3 24.590000 0.046691
4 875.000000 1.661445
Ratios for avg_amount_60_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Avg_Amount Avg_Amount_60_Days_Ratio
0 1040.000000 1.974746
1 465.248125 0.883410
2 123.225000 0.233979
3 332.034545 0.630465
4 118.990000 0.225938
Ratios for avg_amount_90_days_ratio
Agency_Name Merchant_Category Avg_Amount \
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.000000
1 ATTORNEY GENERAL AMERICAN AIRLINES 417.765667
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 97.215000
3 ATTORNEY GENERAL BOOK STORES 123.225000
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 299.336923
Avg_Amount_90_Days_Ratio
0 1.974746
1 0.793251
2 0.184591
3 0.233979
4 0.568379
Ratios for max_amount_30_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Max_Amount Max_Amount_30_Days_Ratio
0 1040.00 1.974746
1 982.00 1.864616
2 1000.00 1.898794
3 24.59 0.046691
4 1325.00 2.515902
Ratios for max_amount_60_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Max_Amount Max_Amount_60_Days_Ratio
0 1040.00 1.974746
1 982.00 1.864616
2 208.50 0.395899
3 1000.00 1.898794
4 306.66 0.582284
Ratios for max_amount_90_days_ratio
Agency_Name Merchant_Category Max_Amount \
0 ATTORNEY GENERAL ADVERTISING SERVICES 1040.00
1 ATTORNEY GENERAL AMERICAN AIRLINES 982.00
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES 186.17
3 ATTORNEY GENERAL BOOK STORES 208.50
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1000.00
Max_Amount_90_Days_Ratio
0 1.974746
1 1.864616
2 0.353499
3 0.395899
4 1.898794
Ratios for avg_transaction_ratio
Agency_Name \
0 ARDMORE HIGHER EDUCATION CENTER
1 ARDMORE HIGHER EDUCATION CENTER
2 ARDMORE HIGHER EDUCATION CENTER
3 ARDMORE HIGHER EDUCATION CENTER
4 ARDMORE HIGHER EDUCATION CENTER
Merchant_Category Avg_Transaction_Count \
0 BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED 1
1 CATALOG MERCHANTS 7
2 COMPUTER NETWORK/INFORMATION SERVICES 1
3 DRUG STORES AND PHARMACIES 1
4 GROCERY STORES,AND SUPERMARKETS 11
Avg_Transaction_Ratio
0 0.019046
1 0.133321
2 0.019046
3 0.019046
4 0.209505
Ratios for avg_transaction_30_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
3 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
4 ATTORNEY GENERAL CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS
Avg_Transaction_Count Avg_Transaction_30_Days_Ratio
0 1 0.019046
1 7 0.133321
2 7 0.133321
3 1 0.019046
4 3 0.057138
Ratios for avg_transaction_60_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL BOOK STORES
3 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
4 ATTORNEY GENERAL CABLE, SATELLITE, AND OTHER PAY TELEVISION AND...
Avg_Transaction_Count Avg_Transaction_60_Days_Ratio
0 1 0.019046
1 16 0.304734
2 2 0.038092
3 11 0.209505
4 3 0.057138
Ratios for avg_transaction_90_days_ratio
Agency_Name Merchant_Category \
0 ATTORNEY GENERAL ADVERTISING SERVICES
1 ATTORNEY GENERAL AMERICAN AIRLINES
2 ATTORNEY GENERAL AUTOMOTIVE PARTS AND ACCESSORIES STORES
3 ATTORNEY GENERAL BOOK STORES
4 ATTORNEY GENERAL BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED
Avg_Transaction_Count Avg_Transaction_90_Days_Ratio
0 1 0.019046
1 30 0.571376
2 2 0.038092
3 2 0.038092
4 13 0.247596
Ratios for avg_amount_vendor_ratio
Agency_Name Vendor Avg_Amount \
0 ARDMORE HIGHER EDUCATION CENTER AGRI PRODUCTS 92.500000
1 ARDMORE HIGHER EDUCATION CENTER CDW GOVERNMENT -53.344286
2 ARDMORE HIGHER EDUCATION CENTER COUNTY BUILDING CE 58.440000
3 ARDMORE HIGHER EDUCATION CENTER EMPIRE PAPER COMPANY 443.540000
4 ARDMORE HIGHER EDUCATION CENTER GODADDY.COM 18.340000
Avg_Amount_Vendor_Ratio
0 0.206872
1 -0.119302
2 0.130699
3 0.991959
4 0.041017
Ratios for avg_amount_vendor_30_days_ratio
Agency_Name Vendor Avg_Amount \
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.20
1 ATTORNEY GENERAL AMERICAN AI 0017450939750 -250.03
2 ATTORNEY GENERAL AMERICAN AI 0017456015142 982.00
3 ATTORNEY GENERAL AMERICAN AI 0017459356062 669.00
4 ATTORNEY GENERAL AMERICAN AI 0017459356079 687.50
Avg_Amount_Vendor_30_Days_Ratio
0 1.937220
1 -0.559182
2 2.196202
3 1.496191
4 1.537565
Ratios for avg_amount_vendor_60_days_ratio
Agency_Name Vendor Avg_Amount \
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017450939750 205.985
3 ATTORNEY GENERAL AMERICAN AI 0017450939785 -361.000
4 ATTORNEY GENERAL AMERICAN AI 0017450939901 250.000
Avg_Amount_Vendor_60_Days_Ratio
0 1.937220
1 0.275588
2 0.460677
3 -0.807361
4 0.559115
Ratios for avg_amount_vendor_90_days_ratio
Agency_Name Vendor Avg_Amount \
0 ATTORNEY GENERAL ADI ASPEN PUBLISHERS 866.200
1 ATTORNEY GENERAL AMAZON MKTPLACE PMTS 123.225
2 ATTORNEY GENERAL AMERICAN AI 0017394725288 444.000
3 ATTORNEY GENERAL AMERICAN AI 0017395886798 673.000
4 ATTORNEY GENERAL AMERICAN AI 0017395886825 0.000
Avg_Amount_Vendor_90_Days_Ratio
0 1.937220
1 0.275588
2 0.992987
3 1.505136
4 0.000000
3.3 Descriptive Statistics of Features¶
# Function to display descriptive statistics of ratios
def display_ratio_stats(ratios):
stats = {}
for key, value in ratios.items():
stats[key] = value.describe()
# Combine statistics into a single dataframe for easier viewing
stats_df = pd.concat(stats, axis=1)
return stats_df
# Display descriptive statistics of ratios
ratio_stats = display_ratio_stats(ratios)
print("Descriptive Statistics of Ratios:", '\n')
print(ratio_stats)
Descriptive Statistics of Ratios:
avg_amount_ratio avg_amount_30_days_ratio \
Avg_Amount Avg_Amount_Ratio Avg_Amount
count 8427.000000 8427.000000 3372.000000
mean 526.650020 1.000000 529.500247
std 2412.571766 4.580977 1304.567007
min -3047.263333 -5.786126 -2393.700000
25% 111.995000 0.212655 86.884113
50% 269.450000 0.511630 240.048810
75% 527.840500 1.002260 556.442353
max 171619.610000 325.870319 45784.640000
avg_amount_60_days_ratio \
Avg_Amount_30_Days_Ratio Avg_Amount
count 3372.000000 4420.000000
mean 1.005412 542.288064
std 2.477104 2266.461185
min -4.545144 -995.600000
25% 0.164975 96.037500
50% 0.455803 265.452500
75% 1.056570 551.635076
max 86.935609 134734.050000
avg_amount_90_days_ratio \
Avg_Amount_60_Days_Ratio Avg_Amount
count 4420.000000 5258.000000
mean 1.029693 538.989403
std 4.303543 3089.953297
min -1.890440 -842.720000
25% 0.182355 102.350694
50% 0.504040 270.000000
75% 1.047441 555.795938
max 255.832232 214206.120000
max_amount_30_days_ratio \
Avg_Amount_90_Days_Ratio Max_Amount
count 5258.000000 3372.000000
mean 1.023430 1552.831412
std 5.867185 5522.956927
min -1.600152 -2393.700000
25% 0.194343 118.747500
50% 0.512674 425.155000
75% 1.055342 1188.075000
max 406.733337 132790.140000
... avg_transaction_90_days_ratio \
Max_Amount_30_Days_Ratio ... Avg_Transaction_Count
count 3372.000000 ... 5258.000000
mean 2.948507 ... 21.052301
std 10.486959 ... 93.690020
min -4.545144 ... 1.000000
25% 0.225477 ... 1.000000
50% 0.807282 ... 3.000000
75% 2.255910 ... 10.000000
max 252.141147 ... 2146.000000
avg_amount_vendor_ratio \
Avg_Transaction_90_Days_Ratio Avg_Amount
count 5258.000000 1.119170e+05
mean 0.400960 4.471355e+02
std 1.784408 6.534493e+03
min 0.019046 -4.907220e+03
25% 0.019046 2.900000e+01
50% 0.057138 1.561667e+02
75% 0.190459 4.560000e+02
max 40.872449 1.827119e+06
avg_amount_vendor_30_days_ratio \
Avg_Amount_Vendor_Ratio Avg_Amount
count 111917.000000 14965.000000
mean 1.000000 507.333145
std 14.614122 1852.339922
min -10.974793 -4987.040000
25% 0.064857 40.680000
50% 0.349260 162.802778
75% 1.019825 474.000000
max 4086.275878 60347.380000
avg_amount_vendor_60_days_ratio \
Avg_Amount_Vendor_30_Days_Ratio Avg_Amount
count 14965.000000 26124.000000
mean 1.134629 492.806725
std 4.142681 2992.894701
min -11.153307 -4999.000000
25% 0.090979 37.000000
50% 0.364102 162.000000
75% 1.060081 475.000000
max 134.964400 348053.750000
avg_amount_vendor_90_days_ratio \
Avg_Amount_Vendor_60_Days_Ratio Avg_Amount
count 26124.000000 37443.000000
mean 1.102142 467.539420
std 6.693484 2442.289144
min -11.180055 -3889.570000
25% 0.082749 36.185000
50% 0.362306 164.430000
75% 1.062318 478.858333
max 778.407704 298416.860000
Avg_Amount_Vendor_90_Days_Ratio
count 37443.000000
mean 1.045632
std 5.462078
min -8.698861
25% 0.080926
50% 0.367741
75% 1.070947
max 667.396869
[8 rows x 30 columns]
3.4 Distribution of Features¶
# Define function to create and plot percentile bins
def plot_percentile_bins(df, var):
binned_var = var + '_bin'
percentile = [0, 0.01, 0.05, 0.2, 0.5, 0.8, 0.95, 0.99, 1]
df[binned_var] = pd.qcut(df[var], percentile, duplicates='drop')
# Create the bin labels
df['bin_label'] = df[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title=f'Distribution of {var} Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template='plotly_white'
)
# Show the plot
fig.show()
# Plot all ratio columns
for key, df_ratio in ratios.items():
ratio_col = df_ratio.columns[-1]
plot_percentile_bins(df_ratio, ratio_col)
Section 4: Histogram-Based Outlier Score (HBOS)¶
Approach:
- HBOS uses histograms to model the distribution of each feature independently. It divides the range of each feature into bins and computes the density of data points in each bin.
Scoring:
- The anomaly score is based on the inverse of the density; regions with lower density (fewer data points) are considered more anomalous.
Advantages:
- Simple and fast, particularly effective for univariate data or when features are independent.
Limitations:
- May not capture interactions between features effectively. Sensitive to bin width and outliers.
# Define functions
def count_stat(vector):
unique, counts = np.unique(vector, return_counts=True)
return dict(zip(unique, counts))
def descriptive_stat_threshold(df, pred_scores, threshold):
df['Anomaly_Score'] = pred_scores
df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')
# Calculate count and percentage
cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100
# Calculate mean statistics
stat = df.groupby('Group').mean().round(2).reset_index()
# Merge count and mean statistics
stat = cnt.merge(stat, on='Group')
return stat
# Initialize parameters for hyperparameter tuning
param_grid = {'n_bins': [600, 700, 800], 'contamination': [0.01, 0.02, 0.03, 0.04, 0.05], 'alpha': [0.1, 0.2, 0.5, 1.0]}
# Prepare to store results for each ratio
descriptive_stats = {}
# Apply HBOS to each ratio DataFrame and calculate descriptive statistics
for key, df in ratios.items():
ratio_col = [col for col in df.columns if col.endswith('Ratio')][0]
# Prepare the data for HBOS
X = df[[ratio_col]].values
# Split data into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
# Standardize data
X_train_norm, X_test_norm = standardizer(X_train, X_test)
# Initialize HBOS model
hbos = HBOS()
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(estimator=hbos, param_grid=param_grid, scoring='roc_auc')
grid_search.fit(X_train_norm)
# Get the best hyperparameters
best_params = grid_search.best_params_
# Use the best hyperparameters to train the final HBOS model
best_hbos = HBOS(**best_params)
best_hbos.fit(X_train_norm)
# Get the anomaly scores and predictions
y_train_scores = best_hbos.decision_function(X_train_norm)
y_train_pred = best_hbos.predict(X_train_norm)
y_test_scores = best_hbos.decision_function(X_test_norm)
y_test_pred = best_hbos.predict(X_test_norm)
# Plot histogram of training scores using Plotly
fig = make_subplots(rows=1, cols=2, subplot_titles=('Training Data', 'Test Data'))
fig.add_trace(
go.Histogram(x=y_train_scores, name='Training'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=y_test_scores, name='Test'),
row=1, col=2
)
fig.update_layout(
title=f"Outlier score for {ratio_col}",
xaxis_title="Outlier Score",
yaxis_title="Frequency",
showlegend=True,
template='seaborn',
height = 450, width = 1000,
paper_bgcolor='rgba(0,0,0,0)'
)
fig.show()
# Print threshold and count statistics
print(f"Best hyperparameters for {ratio_col}: {best_params}")
threshold = best_hbos.threshold_
print(f"The threshold for {ratio_col}: {threshold}")
print(f"The training data (counts) for {ratio_col}: {count_stat(y_train_pred)}")
print(f"The test data (counts) for {ratio_col}: {count_stat(y_test_pred)}", '\n')
# Get the descriptive statistics for outliers
stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=[ratio_col]), y_train_scores, threshold)
stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=[ratio_col]), y_test_scores, threshold)
# Store the statistics
descriptive_stats[ratio_col] = {'train': stats_train, 'test': stats_test}
# Print the descriptive statistics for each ratio
for ratio_col, stats in descriptive_stats.items():
print(f"Descriptive Statistics for {ratio_col} (Training Data):\n", stats['train'], "\n")
print(f"Descriptive Statistics for {ratio_col} (Test Data):\n", stats['test'], "\n")
Best hyperparameters for Avg_Amount_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Ratio: 3.155841153404613
The training data (counts) for Avg_Amount_Ratio: {0: 6677, 1: 64}
The test data (counts) for Avg_Amount_Ratio: {0: 1675, 1: 11}
Best hyperparameters for Avg_Amount_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_30_Days_Ratio: 3.230707405611403
The training data (counts) for Avg_Amount_30_Days_Ratio: {0: 2679, 1: 18}
The test data (counts) for Avg_Amount_30_Days_Ratio: {0: 661, 1: 14}
Best hyperparameters for Avg_Amount_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_60_Days_Ratio: 3.192610297265351
The training data (counts) for Avg_Amount_60_Days_Ratio: {0: 3501, 1: 35}
The test data (counts) for Avg_Amount_60_Days_Ratio: {0: 879, 1: 5}
Best hyperparameters for Avg_Amount_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_90_Days_Ratio: 3.1087878209224846
The training data (counts) for Avg_Amount_90_Days_Ratio: {0: 4164, 1: 42}
The test data (counts) for Avg_Amount_90_Days_Ratio: {0: 1039, 1: 13}
Best hyperparameters for Max_Amount_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_30_Days_Ratio: 3.19238761707751
The training data (counts) for Max_Amount_30_Days_Ratio: {0: 2674, 1: 23}
The test data (counts) for Max_Amount_30_Days_Ratio: {0: 658, 1: 17}
Best hyperparameters for Max_Amount_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_60_Days_Ratio: 3.2619749812774237
The training data (counts) for Max_Amount_60_Days_Ratio: {0: 3506, 1: 30}
The test data (counts) for Max_Amount_60_Days_Ratio: {0: 869, 1: 15}
Best hyperparameters for Max_Amount_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Max_Amount_90_Days_Ratio: 3.2202460330833116
The training data (counts) for Max_Amount_90_Days_Ratio: {0: 4174, 1: 32}
The test data (counts) for Max_Amount_90_Days_Ratio: {0: 1041, 1: 11}
Best hyperparameters for Avg_Transaction_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_Ratio: 3.2269790186497085
The training data (counts) for Avg_Transaction_Ratio: {0: 6686, 1: 55}
The test data (counts) for Avg_Transaction_Ratio: {0: 1666, 1: 20}
Best hyperparameters for Avg_Transaction_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_30_Days_Ratio: 3.14352593082513
The training data (counts) for Avg_Transaction_30_Days_Ratio: {0: 2674, 1: 23}
The test data (counts) for Avg_Transaction_30_Days_Ratio: {0: 652, 1: 23}
Best hyperparameters for Avg_Transaction_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_60_Days_Ratio: 3.2052410942068272
The training data (counts) for Avg_Transaction_60_Days_Ratio: {0: 3508, 1: 28}
The test data (counts) for Avg_Transaction_60_Days_Ratio: {0: 860, 1: 24}
Best hyperparameters for Avg_Transaction_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Transaction_90_Days_Ratio: 3.237319540474148
The training data (counts) for Avg_Transaction_90_Days_Ratio: {0: 4164, 1: 42}
The test data (counts) for Avg_Transaction_90_Days_Ratio: {0: 1031, 1: 21}
Best hyperparameters for Avg_Amount_Vendor_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_Ratio: 2.4561038287135952
The training data (counts) for Avg_Amount_Vendor_Ratio: {0: 88965, 1: 568}
The test data (counts) for Avg_Amount_Vendor_Ratio: {0: 22242, 1: 142}
Best hyperparameters for Avg_Amount_Vendor_30_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_30_Days_Ratio: 3.183548674471341
The training data (counts) for Avg_Amount_Vendor_30_Days_Ratio: {0: 11856, 1: 116}
The test data (counts) for Avg_Amount_Vendor_30_Days_Ratio: {0: 2970, 1: 23}
Best hyperparameters for Avg_Amount_Vendor_60_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_60_Days_Ratio: 3.0824624891173382
The training data (counts) for Avg_Amount_Vendor_60_Days_Ratio: {0: 20694, 1: 205}
The test data (counts) for Avg_Amount_Vendor_60_Days_Ratio: {0: 5175, 1: 50}
Best hyperparameters for Avg_Amount_Vendor_90_Days_Ratio: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 600}
The threshold for Avg_Amount_Vendor_90_Days_Ratio: 3.13566568912757
The training data (counts) for Avg_Amount_Vendor_90_Days_Ratio: {0: 29708, 1: 246}
The test data (counts) for Avg_Amount_Vendor_90_Days_Ratio: {0: 7435, 1: 54}
Descriptive Statistics for Avg_Amount_Ratio (Training Data):
Group Count Count % Avg_Amount_Ratio Anomaly_Score
0 Normal 6668 98.917075 0.81 -1.16
1 Outlier 73 1.082925 20.00 3.24
Descriptive Statistics for Avg_Amount_Ratio (Test Data):
Group Count Count % Avg_Amount_Ratio Anomaly_Score
0 Normal 1673 99.228944 0.83 -1.12
1 Outlier 13 0.771056 13.54 3.26
Descriptive Statistics for Avg_Amount_30_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_30_Days_Ratio Anomaly_Score
0 Normal 2668 98.924731 0.88 -0.43
1 Outlier 29 1.075269 12.52 3.23
Descriptive Statistics for Avg_Amount_30_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_30_Days_Ratio Anomaly_Score
0 Normal 656 97.185185 0.83 -0.43
1 Outlier 19 2.814815 7.71 3.28
Descriptive Statistics for Avg_Amount_60_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_60_Days_Ratio Anomaly_Score
0 Normal 3489 98.670814 0.83 -1.00
1 Outlier 47 1.329186 17.90 3.24
Descriptive Statistics for Avg_Amount_60_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_60_Days_Ratio Anomaly_Score
0 Normal 875 98.9819 0.84 -0.98
1 Outlier 9 1.0181 10.23 3.23
Descriptive Statistics for Avg_Amount_90_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_90_Days_Ratio Anomaly_Score
0 Normal 4143 98.50214 0.81 -1.37
1 Outlier 63 1.49786 16.67 3.18
Descriptive Statistics for Avg_Amount_90_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_90_Days_Ratio Anomaly_Score
0 Normal 1033 98.193916 0.80 -1.36
1 Outlier 19 1.806084 8.56 3.19
Descriptive Statistics for Max_Amount_30_Days_Ratio (Training Data):
Group Count Count % Max_Amount_30_Days_Ratio Anomaly_Score
0 Normal 2657 98.516871 2.16 -1.19
1 Outlier 40 1.483129 53.95 3.19
Descriptive Statistics for Max_Amount_30_Days_Ratio (Test Data):
Group Count Count % Max_Amount_30_Days_Ratio Anomaly_Score
0 Normal 657 97.333333 1.99 -1.21
1 Outlier 18 2.666667 40.39 3.26
Descriptive Statistics for Max_Amount_60_Days_Ratio (Training Data):
Group Count Count % Max_Amount_60_Days_Ratio Anomaly_Score
0 Normal 3497 98.897059 2.41 -1.32
1 Outlier 39 1.102941 97.28 3.26
Descriptive Statistics for Max_Amount_60_Days_Ratio (Test Data):
Group Count Count % Max_Amount_60_Days_Ratio Anomaly_Score
0 Normal 869 98.303167 2.09 -1.32
1 Outlier 15 1.696833 68.52 3.28
Descriptive Statistics for Max_Amount_90_Days_Ratio (Training Data):
Group Count Count % Max_Amount_90_Days_Ratio Anomaly_Score
0 Normal 4162 98.953875 2.49 -1.55
1 Outlier 44 1.046125 110.29 3.26
Descriptive Statistics for Max_Amount_90_Days_Ratio (Test Data):
Group Count Count % Max_Amount_90_Days_Ratio Anomaly_Score
0 Normal 1040 98.859316 2.48 -1.49
1 Outlier 12 1.140684 103.34 3.28
Descriptive Statistics for Avg_Transaction_Ratio (Training Data):
Group Count Count % Avg_Transaction_Ratio Anomaly_Score
0 Normal 6662 98.828067 0.55 -2.82
1 Outlier 79 1.171933 41.50 3.26
Descriptive Statistics for Avg_Transaction_Ratio (Test Data):
Group Count Count % Avg_Transaction_Ratio Anomaly_Score
0 Normal 1664 98.695136 0.46 -2.89
1 Outlier 22 1.304864 33.58 3.30
Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_30_Days_Ratio Anomaly_Score
0 Normal 2636 97.738228 0.13 -2.37
1 Outlier 61 2.261772 3.19 3.14
Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_30_Days_Ratio Anomaly_Score
0 Normal 647 95.851852 0.10 -2.50
1 Outlier 28 4.148148 2.58 3.25
Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_60_Days_Ratio Anomaly_Score
0 Normal 3499 98.95362 0.25 -2.29
1 Outlier 37 1.04638 5.34 3.21
Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_60_Days_Ratio Anomaly_Score
0 Normal 855 96.719457 0.15 -2.51
1 Outlier 29 3.280543 5.63 3.28
Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_90_Days_Ratio Anomaly_Score
0 Normal 4150 98.668569 0.23 -2.36
1 Outlier 56 1.331431 11.82 3.24
Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_90_Days_Ratio Anomaly_Score
0 Normal 1031 98.003802 0.23 -2.32
1 Outlier 21 1.996198 11.44 3.29
Descriptive Statistics for Avg_Amount_Vendor_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_Ratio Anomaly_Score
0 Normal 85871 95.909888 0.59 -1.25
1 Outlier 3662 4.090112 10.90 2.58
Descriptive Statistics for Avg_Amount_Vendor_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_Ratio Anomaly_Score
0 Normal 21468 95.907791 0.60 -1.25
1 Outlier 916 4.092209 9.09 2.58
Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_30_Days_Ratio Anomaly_Score
0 Normal 11842 98.914133 0.87 -0.86
1 Outlier 130 1.085867 25.95 3.27
Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_30_Days_Ratio Anomaly_Score
0 Normal 2966 99.097895 0.89 -0.81
1 Outlier 27 0.902105 22.71 3.28
Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_60_Days_Ratio Anomaly_Score
0 Normal 20616 98.645868 0.86 -0.95
1 Outlier 283 1.354132 17.82 3.19
Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_60_Days_Ratio Anomaly_Score
0 Normal 5160 98.755981 0.84 -0.96
1 Outlier 65 1.244019 26.89 3.20
Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_90_Days_Ratio Anomaly_Score
0 Normal 29629 98.915003 0.81 -0.74
1 Outlier 325 1.084997 23.53 3.23
Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_90_Days_Ratio Anomaly_Score
0 Normal 7409 98.931767 0.80 -0.75
1 Outlier 80 1.068233 17.85 3.21
Key Takeaways
Average Amount by Agency, Merchant:
The outliers from this category have significantly higher average amounts compared to the norm.
- Overall: Training data (1.08% outliers), Test data (0.77% outliers)
- 30 Days: Training data (1.08% outliers), Test data (2.81% outliers)
- 60 Days: Training data (1.33% outliers), Test data (1.02% outliers)
- 90 Days: Training data (1.50% outliers), Test data (1.81% outliers)
Transactions with high average amounts over the last 30 days are potential outliers. These may signal sudden changes in spending behavior, which could be indicative of special events or fraud. while higher averages across different time period indicated sustained increases in spending, it might be regular large transactions or signs of financial issues.
Max Amount by Agency, Merchant:
Maximum transaction amounts suggest instances of unusually large transactions which could be one-off large purchases or potential fraud.
- 30 Days: Training data (1.48% outliers), Test data (2.67% outliers)
- 60 Days: Training data (1.10% outliers), Test data (1.70% outliers)
- 90 Days: Training data (1.05% outliers), Test data (1.14% outliers)
Transactions over different time period can help in understanding major financial activitives or fraudulent transactions that need further investigation.
Average Transaction by Agency, Merchant:
Higher average transaction ratios suggest frequent high-value transactions, which might be normal for some businesses but could indicate unusual activity in others.
- Overall: Training data (1.17% outliers), Test data (1.30% outliers)
- 30 Days: Training data (2.26% outliers), Test data (4.15% outliers)
- 60 Days: Training data (1.05% outliers), Test data (3.28% outliers)
- 90 Days: Training data (1.33% outliers), Test data (2.00% outliers)
Outliers in average transactions over 30 days might reflect a sudden spike in transaction frequency or amount, requiring investigation into potential reasons such as promotions or fraud. For longer period, outliers could indicate changing business patterns either business growth or unsual activity.
Average Amoung by Agency, Vendor:
Similarly to Merchant, this category add granularity into the vendor level
- Overall: Training data (4.09% outliers), Test data (4.09% outliers)
- 30 Days: Training data (1.09% outliers), Test data (0.90% outliers)
- 60 Days: Training data (1.35% outliers), Test data (1.24% outliers)
- 90 Days: Training data (1.08% outliers), Test data (1.07% outliers)
Vendors with high average amounts over the last 30 days could be part of specific promotions or might indicate potential issues like overbilling or fraud.
Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD) ¶
Approach:
- ECOD uses the empirical cumulative distribution function (ECDF) to model the data. It considers the rank of each data point in the cumulative distribution.
Scoring:
- The anomaly score is based on the empirical distribution of the data. Points far from the cumulative distribution curve are considered more anomalous.
Advantages:
- Non-parametric and can handle non-linear relationships and interactions between features.
Limitations:
- Can be computationally intensive for large datasets. Sensitive to the empirical distribution.
# Define a scorer for GridSearchCV
def ecod_scorer(y_true, y_pred):
return f1_score(y_true, y_pred)
# Apply ECOD to each ratio DataFrame and calculate descriptive statistics
for key, df in ratios.items():
ratio_col = [col for col in df.columns if col.endswith('Ratio')][0]
# Prepare the data for ECOD
X = df[[ratio_col]].values
# Split data into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
# Define the parameter grid
param_grid = {
'contamination': [0.01, 0.03, 0.05, 0.1, 0.2]
}
# Initialize ECOD model
ecod = ECOD()
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=ecod, param_grid=param_grid, scoring=make_scorer(ecod_scorer), cv=3)
# Fit ECOD model on training data
grid_search.fit(X_train)
# Get the best model
best_ecod = grid_search.best_estimator_
# Get anomaly scores and predictions
y_train_scores = best_ecod.decision_function(X_train)
y_train_pred = best_ecod.predict(X_train)
y_test_scores = best_ecod.decision_function(X_test)
y_test_pred = best_ecod.predict(X_test)
# Print statistics
print(f"Best contamination: {grid_search.best_params_['contamination']}")
print(f"Training data (counts): {count_stat(y_train_pred)}")
print(f"Test data (counts): {count_stat(y_test_pred)}")
print(f"Threshold: {best_ecod.threshold_}")
# Plot histogram of ECOD scores using Plotly
fig = make_subplots(rows=1, cols=2, subplot_titles=('Training Data', 'Test Data'))
fig.add_trace(
go.Histogram(x=y_train_scores, name='Training'),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=y_test_scores, name='Test'),
row=1, col=2
)
fig.update_layout(
title=f"ECOD scores for {ratio_col}",
xaxis_title="ECOD Score",
yaxis_title="Frequency",
showlegend=True,
template='seaborn',
height=450,
width=1000,
paper_bgcolor='rgba(0,0,0,0)'
)
fig.show()
# Get descriptive statistics for outliers
stats_train = descriptive_stat_threshold(pd.DataFrame(X_train, columns=[ratio_col]), y_train_scores, best_ecod.threshold_)
stats_test = descriptive_stat_threshold(pd.DataFrame(X_test, columns=[ratio_col]), y_test_scores, best_ecod.threshold_)
# Store the statistics
descriptive_stats[ratio_col] = {'train': stats_train, 'test': stats_test}
# Print descriptive statistics for each ratio
for ratio_col, stats in descriptive_stats.items():
print(f"Descriptive Statistics for {ratio_col} (Training Data):\n", stats['train'], "\n")
print(f"Descriptive Statistics for {ratio_col} (Test Data):\n", stats['test'], "\n")
Best contamination: 0.01
Training data (counts): {0: 6673, 1: 68}
Test data (counts): {0: 1672, 1: 14}
Threshold: 4.978719682275966
Best contamination: 0.01
Training data (counts): {0: 2671, 1: 26}
Test data (counts): {0: 670, 1: 5}
Threshold: 5.260837993524471
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 878, 1: 6}
Threshold: 5.2614561384167775
Best contamination: 0.01
Training data (counts): {0: 4164, 1: 42}
Test data (counts): {0: 1039, 1: 13}
Threshold: 5.253224902904329
Best contamination: 0.01
Training data (counts): {0: 2671, 1: 26}
Test data (counts): {0: 670, 1: 5}
Threshold: 5.260837993524471
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 874, 1: 10}
Threshold: 5.2614561384167775
Best contamination: 0.01
Training data (counts): {0: 4164, 1: 42}
Test data (counts): {0: 1040, 1: 12}
Threshold: 5.253224902904329
Best contamination: 0.01
Training data (counts): {0: 6673, 1: 68}
Test data (counts): {0: 1670, 1: 16}
Threshold: 4.5906163359088765
Best contamination: 0.01
Training data (counts): {0: 2670, 1: 27}
Test data (counts): {0: 668, 1: 7}
Threshold: 4.569145518731359
Best contamination: 0.01
Training data (counts): {0: 3500, 1: 36}
Test data (counts): {0: 874, 1: 10}
Threshold: 4.577642844335586
Best contamination: 0.01
Training data (counts): {0: 4163, 1: 43}
Test data (counts): {0: 1040, 1: 12}
Threshold: 4.5819177646578435
Best contamination: 0.01
Training data (counts): {0: 88637, 1: 896}
Test data (counts): {0: 22163, 1: 221}
Threshold: 5.296855829268316
Best contamination: 0.01
Training data (counts): {0: 11852, 1: 120}
Test data (counts): {0: 2965, 1: 28}
Threshold: 5.284245502365114
Best contamination: 0.01
Training data (counts): {0: 20691, 1: 208}
Test data (counts): {0: 5174, 1: 51}
Threshold: 5.293496239760728
Best contamination: 0.01
Training data (counts): {0: 29654, 1: 300}
Test data (counts): {0: 7423, 1: 66}
Threshold: 5.293261248815198
Descriptive Statistics for Avg_Amount_Ratio (Training Data):
Group Count Count % Avg_Amount_Ratio Anomaly_Score
0 Normal 6673 98.991248 0.83 1.64
1 Outlier 68 1.008752 19.53 6.15
Descriptive Statistics for Avg_Amount_Ratio (Test Data):
Group Count Count % Avg_Amount_Ratio Anomaly_Score
0 Normal 1672 99.169632 0.85 1.63
1 Outlier 14 0.830368 11.33 6.01
Descriptive Statistics for Avg_Amount_30_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_30_Days_Ratio Anomaly_Score
0 Normal 2669 98.961809 0.88 1.64
1 Outlier 28 1.038191 12.68 6.10
Descriptive Statistics for Avg_Amount_30_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_30_Days_Ratio Anomaly_Score
0 Normal 670 99.259259 0.95 1.66
1 Outlier 5 0.740741 10.97 6.67
Descriptive Statistics for Avg_Amount_60_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_60_Days_Ratio Anomaly_Score
0 Normal 3500 98.9819 0.89 1.64
1 Outlier 36 1.0181 17.12 6.15
Descriptive Statistics for Avg_Amount_60_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_60_Days_Ratio Anomaly_Score
0 Normal 878 99.321267 0.90 1.64
1 Outlier 6 0.678733 6.03 5.96
Descriptive Statistics for Avg_Amount_90_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_90_Days_Ratio Anomaly_Score
0 Normal 4162 98.953875 0.87 1.64
1 Outlier 44 1.046125 17.79 6.14
Descriptive Statistics for Avg_Amount_90_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_90_Days_Ratio Anomaly_Score
0 Normal 1039 98.764259 0.86 1.63
1 Outlier 13 1.235741 7.05 5.99
Descriptive Statistics for Max_Amount_30_Days_Ratio (Training Data):
Group Count Count % Max_Amount_30_Days_Ratio Anomaly_Score
0 Normal 2669 98.961809 2.33 1.64
1 Outlier 28 1.038191 60.12 6.10
Descriptive Statistics for Max_Amount_30_Days_Ratio (Test Data):
Group Count Count % Max_Amount_30_Days_Ratio Anomaly_Score
0 Normal 670 99.259259 2.44 1.65
1 Outlier 5 0.740741 80.52 6.32
Descriptive Statistics for Max_Amount_60_Days_Ratio (Training Data):
Group Count Count % Max_Amount_60_Days_Ratio Anomaly_Score
0 Normal 3500 98.9819 2.67 1.64
1 Outlier 36 1.0181 79.80 6.15
Descriptive Statistics for Max_Amount_60_Days_Ratio (Test Data):
Group Count Count % Max_Amount_60_Days_Ratio Anomaly_Score
0 Normal 874 98.868778 2.50 1.61
1 Outlier 10 1.131222 66.04 5.97
Descriptive Statistics for Max_Amount_90_Days_Ratio (Training Data):
Group Count Count % Max_Amount_90_Days_Ratio Anomaly_Score
0 Normal 4162 98.953875 2.77 1.64
1 Outlier 44 1.046125 83.89 6.14
Descriptive Statistics for Max_Amount_90_Days_Ratio (Test Data):
Group Count Count % Max_Amount_90_Days_Ratio Anomaly_Score
0 Normal 1040 98.859316 2.88 1.65
1 Outlier 12 1.140684 69.01 6.00
Descriptive Statistics for Avg_Transaction_Ratio (Training Data):
Group Count Count % Avg_Transaction_Ratio Anomaly_Score
0 Normal 6673 98.991248 0.56 1.38
1 Outlier 68 1.008752 46.73 5.55
Descriptive Statistics for Avg_Transaction_Ratio (Test Data):
Group Count Count % Avg_Transaction_Ratio Anomaly_Score
0 Normal 1670 99.051008 0.5 1.36
1 Outlier 16 0.948992 42.0 5.47
Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_30_Days_Ratio Anomaly_Score
0 Normal 2670 98.998888 0.14 1.21
1 Outlier 27 1.001112 6.10 5.51
Descriptive Statistics for Avg_Transaction_30_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_30_Days_Ratio Anomaly_Score
0 Normal 668 98.962963 0.15 1.22
1 Outlier 7 1.037037 5.26 5.29
Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_60_Days_Ratio Anomaly_Score
0 Normal 3500 98.9819 0.20 1.28
1 Outlier 36 1.0181 10.31 5.51
Descriptive Statistics for Avg_Transaction_60_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_60_Days_Ratio Anomaly_Score
0 Normal 874 98.868778 0.22 1.26
1 Outlier 10 1.131222 9.91 5.42
Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Training Data):
Group Count Count % Avg_Transaction_90_Days_Ratio Anomaly_Score
0 Normal 4163 98.977651 0.25 1.31
1 Outlier 43 1.022349 14.08 5.52
Descriptive Statistics for Avg_Transaction_90_Days_Ratio (Test Data):
Group Count Count % Avg_Transaction_90_Days_Ratio Anomaly_Score
0 Normal 1040 98.859316 0.27 1.33
1 Outlier 12 1.140684 16.21 5.74
Descriptive Statistics for Avg_Amount_Vendor_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_Ratio Anomaly_Score
0 Normal 88637 98.999252 0.79 1.62
1 Outlier 896 1.000748 23.13 6.29
Descriptive Statistics for Avg_Amount_Vendor_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_Ratio Anomaly_Score
0 Normal 22163 99.012688 0.79 1.62
1 Outlier 221 0.987312 16.95 6.35
Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_30_Days_Ratio Anomaly_Score
0 Normal 11852 98.997661 0.93 1.64
1 Outlier 120 1.002339 22.62 6.24
Descriptive Statistics for Avg_Amount_Vendor_30_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_30_Days_Ratio Anomaly_Score
0 Normal 2965 99.064484 0.93 1.64
1 Outlier 28 0.935516 17.28 6.39
Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_60_Days_Ratio Anomaly_Score
0 Normal 20689 98.995167 0.89 1.63
1 Outlier 210 1.004833 20.21 6.26
Descriptive Statistics for Avg_Amount_Vendor_60_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_60_Days_Ratio Anomaly_Score
0 Normal 5174 99.023923 0.87 1.62
1 Outlier 51 0.976077 30.93 6.26
Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Training Data):
Group Count Count % Avg_Amount_Vendor_90_Days_Ratio Anomaly_Score
0 Normal 29654 98.998464 0.87 1.63
1 Outlier 300 1.001536 19.50 6.27
Descriptive Statistics for Avg_Amount_Vendor_90_Days_Ratio (Test Data):
Group Count Count % Avg_Amount_Vendor_90_Days_Ratio Anomaly_Score
0 Normal 7423 99.118707 0.87 1.63
1 Outlier 66 0.881293 14.33 6.18
Key Takeaways
Average Amount by Agency, Merchant:
- Overall: Training data (1.01% outliers), Test data (0.83% outliers)
- 30 Days: Training data (1.04% outliers), Test data (0.74% outliers)
- 60 Days: Training data (1.02% outliers), Test data (0.68% outliers)
- 90 Days: Training data (1.05% outliers), Test data (1.24% outliers)
Max Amount by Agency, Merchant:
- 30 Days: Training data (1.04% outliers), Test data (0.74% outliers)
- 60 Days: Training data (1.02% outliers), Test data (1.13% outliers)
- 90 Days: Training data (1.05% outliers), Test data (1.14% outliers)
Average Transaction by Agency, Merchant:
- Overall: Training data (1.01% outliers), Test data (0.95% outliers)
- 30 Days: Training data (1.00% outliers), Test data (1.04% outliers)
- 60 Days: Training data (1.02% outliers), Test data (1.13% outliers)
- 90 Days: Training data (1.02% outliers), Test data (1.14% outliers)
Average Amoung by Agency, Vendor:
- Overall: Training data (1.00% outliers), Test data (0.99% outliers)
- 30 Days: Training data (1.00% outliers), Test data (0.94% outliers)
- 60 Days: Training data (1.00% outliers), Test data (0.98% outliers)
- 90 Days: Training data (1.00% outliers), Test data (0.88% outliers)
Section 6: Models Predictions Comparison¶
# HBOS
hbos = HBOS(n_bins=600, contamination=0.01)
hbos.fit(X_train)
y_train_hbos_pred = hbos.labels_
y_test_hbos_pred = hbos.predict(X_test)
y_train_hbos_scores = hbos.decision_function(X_train)
y_test_hbos_scores = hbos.decision_function(X_test)
# ECOD
ecod = ECOD(contamination=0.01)
ecod.fit(X_train)
y_train_ecod_pred = ecod.labels_
y_test_ecod_pred = ecod.predict(X_test)
y_train_ecod_scores = ecod.decision_scores_ # raw outlier scores
y_test_ecod_scores = ecod.decision_function(X_test)
# Thresholds
thresholds = [ecod.threshold_, hbos.threshold_]
print("Thresholds for ECOD and HBOS:", thresholds, '\n')
# Compare HBOS and ECOD predictions
comparison_df = pd.DataFrame({'HBOS_pred': y_test_hbos_pred, 'ECOD_pred': y_test_ecod_pred})
print(pd.crosstab(comparison_df['HBOS_pred'], comparison_df['ECOD_pred']))
Thresholds for ECOD and HBOS: [5.293261248815198, 3.2885504005909816] ECOD_pred 0 1 HBOS_pred 0 7405 30 1 18 36
# Compute confusion matrix
cm = pd.crosstab(y_test_hbos_pred, y_test_ecod_pred)
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "#e3e7e6"],
[1.0, "#eac06c"]
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between HBOS and ECOD", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='Predicted label', showticklabels=True),
yaxis=dict(title='True label', showticklabels=True),
autosize=False,
width=500,
height=500
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
Key Takeways from Crosstab:
Agree on Normal:
- Both HBOS and ECOD predicted 7405 instances as normal (0)
Agree on Outliers:
- Both HBOS and ECOD predicted 36 instances as outliers (1)
Discrepancy in Predictions:
- There are 30 instances where HBOS predicted normal (0), but ECOD predicted outliers (1)
- There are 18 instances where HBOS predicted outliers (1), but ECOD predicted normal (0). This indicates that HBOS is more sensitive, potentially identifying more outliers but including some that ECOD considers normal.
The high count (7405) of instances predicted as normal by both models indicates a strong agreement on what constitutes normal behavior.
Both models agreed on 36 instances being outliers. This agreement suggests that these instances are likely true outliers, as both models identified them as such despite their different methodologies.
ECOD is stricter with a higher threshold, leading to fewer false positives and potentially misses some outliers that HBOS catches (higher false negatives). On the other hand, HBOS is more sensitive with a lower threshold, leading to more outliers detected, including some instances that ECOD considers normal (higher false postives)
Section 7: Conclusion¶
Detection Method Comparison:
HBOS tends to identify fewer outliers with scores more tightly clustered around the center. ECOD detects a broader range of outliers, showing more granularity in the outlier scores.
Outliers Locations:
Outliers are generally detected with higher scores (above 2 for HBOS and above 5 for ECOD). These outliers are found at the tails of the distribution in both training and test datasets.
In the context of credit card transaction anomaly detection, both models have their strengths and weaknesses. ECOD is more aggressive in flagging anomalies, which might be better for catching more fraudulent transactions at the cost of higher false positives. HBOS, being more conservative, might result in fewer disruptions to legitimate transactions but may miss some anomalies. Depending on the specific goals and risk tolerance of the anomaly detection system, either model, or a combination of both, could be employed to enhance the overall detection capability.